In this report I will look to highlight the firms that should be focused on by our supervisors using key metrics that I will explain throughout the report. I will keep all my code in the report for audit and reusability purposes as conclusions may change as new data arrives. I will focus this piece on the three main characteristics stated: Firm Size, Changing Business Profile and Outliers.
I have used a combination of openxlsx and tidyverse packages to render the xlsx into a R-readable output. Below I have produced a summary of all the variables.
| firm | year | value_type | eof_for_scr_m | equity_m | gwp_m | gross_bel_m | gross_claims_incurred_m | gross_combined_ratio | gross_expense_ratio | nwp_m | net_bel_m | net_combined_ratio | net_expense_ratio | pure_gross_claims_ratio | pure_net_claims_ratio | scr_m | scr_coverage_ratio | total_assets_m | total_liabilities_m | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:9120 | Length:9120 | Length:9120 | Min. : -162.25 | Min. : -123.610 | Min. : -19.78 | Min. : -167.82 | Min. :-189.936 | Min. : -14064 | Min. :-16112.4 | Min. :-17754.10 | Min. : -182.382 | Min. :-5468706 | Min. :-4495420 | Min. :-24817.8 | Min. :-33721.0 | Min. : 0.000 | Min. : -3 | Min. : -207.4 | Min. : -1487.1 | |
| Class :character | Class :character | Class :character | 1st Qu.: 5.98 | 1st Qu.: 5.798 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 0 | 1st Qu.: 0.0 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 0 | 1st Qu.: 0 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 1.794 | 1st Qu.: 1 | 1st Qu.: 14.4 | 1st Qu.: 3.8 | |
| Mode :character | Mode :character | Mode :character | Median : 32.09 | Median : 34.276 | Median : 15.47 | Median : 10.48 | Median : 3.392 | Median : 0 | Median : 0.1 | Median : 7.95 | Median : 4.502 | Median : 0 | Median : 0 | Median : 0.2 | Median : 0.1 | Median : 14.533 | Median : 2 | Median : 113.0 | Median : 63.0 | |
| NA | NA | NA | Mean : 455.68 | Mean : 480.865 | Mean : 853.94 | Mean : 259.07 | Mean : 116.696 | Mean : 1859 | Mean : 273.7 | Mean : 689.12 | Mean : 164.284 | Mean : 172 | Mean : 1370 | Mean : 463.0 | Mean : 388.1 | Mean : 308.437 | Mean : 1301158 | Mean : 6333.0 | Mean : 6363.4 | |
| NA | NA | NA | 3rd Qu.: 165.65 | 3rd Qu.: 176.937 | 3rd Qu.: 198.96 | 3rd Qu.: 159.34 | 3rd Qu.: 79.463 | 3rd Qu.: 1 | 3rd Qu.: 0.3 | 3rd Qu.: 125.95 | 3rd Qu.: 93.771 | 3rd Qu.: 1 | 3rd Qu.: 0 | 3rd Qu.: 0.5 | 3rd Qu.: 0.5 | 3rd Qu.: 95.191 | 3rd Qu.: 3 | 3rd Qu.: 955.4 | 3rd Qu.: 655.7 | |
| NA | NA | NA | Max. :41636.30 | Max. :26705.042 | Max. :74078.64 | Max. :19292.07 | Max. :6844.014 | Max. :3978266 | Max. :575064.8 | Max. : 75526.67 | Max. :11351.609 | Max. : 2445529 | Max. : 2691212 | Max. :985011.3 | Max. :828809.4 | Max. :22788.359 | Max. :999302877 | Max. :553549.9 | Max. :494498.5 | |
| NA | NA | NA | NA’s :2620 | NA’s :2620 | NA’s :2620 | NA | NA | NA | NA | NA’s :2620 | NA | NA | NA | NA | NA | NA’s :2620 | NA’s :2620 | NA’s :2620 | NA’s :2620 |
From this summary we can see each of the columns are in the correct data type and there is significant variance across the numeric variables. The dataset looks like this now:
| firm | year | value_type | eof_for_scr_m | equity_m | gwp_m | gross_bel_m | gross_claims_incurred_m | gross_combined_ratio | gross_expense_ratio | nwp_m | net_bel_m | net_combined_ratio | net_expense_ratio | pure_gross_claims_ratio | pure_net_claims_ratio | scr_m | scr_coverage_ratio | total_assets_m | total_liabilities_m |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Firm 1 | 2016 | mean_value | 484.0712 | 249.6647 | 9.353606 | 0.1177988 | 0.0011030 | 0.1350101 | 15.45025 | -3550.820 | 1.978919 | 14.25977 | 8.202612 | 3.506938 | 3.436862 | 404.2644 | 11177400.6440186 | 12737.29 | 8034.337 |
| Firm 1 | 2016 | median_value | 0.0000 | 0.0000 | 0.000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | 0.000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.0000000 | 0.00 | 0.000 |
| Firm 1 | 2016 | sd_value | 1080.2952 | 555.8116 | 20.915299 | 0.2634062 | 0.0024665 | 0.3018918 | 34.54781 | 7939.875 | 4.424999 | 31.88581 | 18.341598 | 7.841752 | 7.685056 | 903.9626 | 24993427.5609242 | 28476.09 | 17965.323 |
| Firm 1 | 2016 | time_series_value | 2416.5598 | 1243.9250 | 46.768030 | 0.0000000 | 0.0000000 | 0.0000000 | 0.00000 | -17754.100 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 2021.3219 | 0.1625197 | 63676.87 | 40171.683 |
| Firm 1 | 2017 | mean_value | 484.0712 | 249.6647 | 9.353606 | 0.1177988 | 0.0011030 | 0.1350101 | 15.45025 | -3550.820 | 1.978919 | 14.25977 | 8.202612 | 3.506938 | 3.436862 | 404.2644 | 11177400.6440186 | 12737.29 | 8034.337 |
I have set up the datatable so that only value_type time_series_value varies over time, all the other variables in this column stay consistent. This makes it straightforward to wrangle to find the top/bottom n firms or most/least varying firms by variable as you will see in my visualisation code.
pull_unique_firms = function(wb, sheet_n){
openxlsx::readWorkbook(
xlsxFile = wb,
sheet = openxlsx::sheets(wb)[sheet_n])['X1'] %>%
dplyr::pull()
}
firms_sheet1_vec = pull_unique_firms(wb, 1)
firms_sheet2_vec = pull_unique_firms(wb, 2)
length(setdiff(firms_sheet2_vec, firms_sheet1_vec))
There are some caveats which will be worth considering with my data-centric approach to resource allocation, these are:
To consider firm size I will look at:
As we want to consider both magnitude and consistency of the variables, I will use the median score to select the highest rated firms and will plot them against the rest.
From this graph we can see there are five firms that are distinctly larger than the rest and I have only considered the top 20% of firms by median value. It is worth highlighting, and I will explore later, the volatility of some of the firms accounts as we can see firm 311 and 210 have large drops in 2020 (YE).
Only Firms 210 and 4 are considerably larger than the rest of the top 20% of firms (for this variable).
Firm
4 has appears twice now, firstly for having a significantly larger Net
Written Premium and now for Equity. This suggests to us that it is a
largely profitable insurance firm with a large equity holding meaning it
is in a very healthy financial position. I would suggest this firm is
definitely an important player in the market and one to monitor.
| firm | size_score | nwp_m_abs_sd_normalised | net_bel_m_abs_sd_normalised | pure_gross_claims_ratio_abs_sd_normalised |
|---|---|---|---|---|
| Firm 210 | 29.276608 | 87.153007 | 0.6719286 | 0.0048892 |
| Firm 105 | 20.448670 | 51.795731 | 9.5449170 | 0.0053618 |
| Firm 4 | 19.790848 | 59.012878 | 0.3564204 | 0.0032472 |
| Firm 311 | 11.391854 | 33.933678 | 0.2394160 | 0.0024683 |
| Firm 26 | 10.151985 | 29.369312 | 1.0823845 | 0.0042572 |
| Firm 199 | 8.642304 | 25.674708 | 0.2488247 | 0.0033785 |
| Firm 151 | 6.559077 | 19.274517 | 0.4007712 | 0.0019427 |
| Firm 25 | 5.705377 | 16.702546 | 0.4107381 | 0.0028466 |
| Firm 73 | 5.326167 | 15.714369 | 0.2590468 | 0.0050841 |
| Firm 52 | 5.125987 | 1.249713 | 14.1235304 | 0.0047166 |
I have created a table here that provides a potential list of the top 10 firms using only firms that reported on the general template also. I have calculated this by taking the mean normalised absolute standard deviation of the median value for each firm on specific metrics. To explain this, I first take the median, because mean is more affected by outliers, value of each metric for each firm and then find the absolute standard deviation against the metric. This allows me to find the firms that are, on average, far above the population for each metric. As the distributions vary for each of the metrics I then normalise each of this scores so that when I take the mean there is no weighting bias. I chose the variable Net Weighted Premium, Net BEL (inc. TPs as a whole, pre-TMTP) and Pure Gross Claims Ratio because as you can see from my correlation matrix above, these three variables explain a lot of the variance of the entire dataset and I felt they were therefore wide ranging and mutually exclusive enough to consider all aspects of the business. It is worth noting as you can see from my outputs there is an issue of possible outliers that I will consider and explore later.
Volatility of a firm’s accounts are of considerable concern to Supervisors and so it’s important to investigate.
I will start with considering variation the metric level. I am going to use relative standard deviation as this will allow me to compare SDs across each of the variables effectively.
This chart shows us that RSD is varies across metrics over time and that there are multiple metrics with high variance however there are six main metrics that have a high relative standard deviation all of which are ratios these are: gross_expense_ratio, net_expense_ratio, gross_combined_ratio, net_combined_ratio, pure_gross_claims_ratio, and pure_net_claims_ratio. Although there are 6 they are 3 highly correlated net and gross pairs. There are also strong relationships by definition. net_combined_ratio is \(\frac{Sum of net claims and expenses incurred}{Net Earned Premium}\) while net_expense_ratio is just the expenses part of the equation suggesting this could be an expenses variance. We also see a massive spike in 2016 for RSD in gross_expense_ratio which I would suspect is misreporting especially given no spike in net_expense_ratio and I will explore later. I will look into the firms that have produced these high variations in the net variables as these will likely explain the variations in the gross variables also.
We can see from this graph that net_expense_ratio spikes has
potentially come from these two enormous values that have been reported.
I will add a filter to this chart and the other ratio charts to have
median values between 0 and 0.87 (this is the range set by 3Q + 1.5 *
IQR). I will be consistent in flagging and filtering outliers in this
way.
After filtering out misreporting firms, we can see the five firms whose
data has changed most substantially over the period 2016-20. This is
important for supervisors to be aware of because its ability to damage
investors confidence in the firm as well as conduct any reasonable
long-term financial planning.
As Firm 418 has misreported both net_combined_ratio and
net_expense_ratio, we can assume the issue is in a misreporting
of their expenses which has had a knock-on impact on multiple of their
larger KPIs.
Again, once these are removed we can see that multiple firms with
volatile ratios. Firms 287, 291 and 417 having both a volatile and high
net combined ratio would be cause for concern for
supervisors.
This is calculated as \(\frac{Net Claims}{Net Earned Premium}\) and so a firm like 38 who has managed to reduce this from a poor position is a positive and something that would stop concern from supervisors.
I’ve shown a couple of techniques I have used to filter out outliers and highlight elements in the time series. I have chosen to use the interquartile range method due to the presence of extreme outliers in the data. Standard Deviation is influenced much more strongly by extreme outliers. I have created a flagging system, if the data point is outside of \((Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR,)\) for the entire metric population then it is a ‘IQR Outlier’, if it is outside of the 2nd or 8th quantile then it is labelled a ‘Decile Outlier’ else it has ‘No Flag’. I will provide an example below for firm 12.
We can see there is fair homogeneity in submissions and there is no systemic reporting issues towards any particular metric.
This approach will allow users to do is to flagged submissions quickly, if we take Firm 139’s scr_m and Firm 222’s nwp_m submissions we can see that they have a range of flags throughout the series.
| firm | year | scr_m | nwp_m |
|---|---|---|---|
| Firm 139 | 2016 | 148.44436 | 700.60412 |
| Firm 139 | 2017 | 102.00252 | 678.88901 |
| Firm 139 | 2018 | 384.08633 | 1013.57043 |
| Firm 139 | 2019 | 102.85466 | 208.04965 |
| Firm 139 | 2020 | 137.17073 | 468.09469 |
| Firm 222 | 2016 | 157.07521 | 223.59939 |
| Firm 222 | 2017 | 92.17920 | 423.15529 |
| Firm 222 | 2018 | 286.40581 | 291.80749 |
| Firm 222 | 2019 | 66.80772 | 153.44594 |
| Firm 222 | 2020 | 78.87480 | 79.30072 |
If we graph that we can see this:
So this method allows us to look at each firm’s submission against the rest of the population and detect whether we consider it an outlier or not. In this case we can safely assume the IQR Outlier will require a resubmission.
For the machine learning calculations, I am going to remove all IQR outliers to help with explanability.
filtered_df = dplyr::anti_join(
x = df %>%
filter(value_type == 'time_series_value') %>%
pivot_longer(cols = -c('firm', 'year', 'value_type')),
y = outliers_df %>%
pivot_longer(cols = -c('firm', 'year')) %>%
filter(value == 'IQR Outlier'),
by = c('firm', 'year')
) %>%
# we are going to describe the whole time series so just need firm and metric
dplyr::group_by(firm, name) %>%
dplyr::summarise(
# this are the variables I will use (for now)
median_value = median(value, na.rm = T),
# mad_value = mad(value, na.rm = T)
) %>%
tidyr::pivot_wider(
names_from = name,
values_from = median_value)
kable(head(filtered_df)) %>%
kable_styling("striped", full_width = F)
| firm | eof_for_scr_m | equity_m | gross_bel_m | gross_claims_incurred_m | gross_combined_ratio | gross_expense_ratio | gwp_m | net_bel_m | net_combined_ratio | net_expense_ratio | nwp_m | pure_gross_claims_ratio | pure_net_claims_ratio | scr_coverage_ratio | scr_m | total_assets_m | total_liabilities_m |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Firm 1 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0.000000 | 0.000000 | 0.00000 | 0.0000000 |
| Firm 101 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0.000000 | 0.000000 | 0.00000 | 0.0000000 |
| Firm 102 | 0.00000 | 0.000000 | -22.240795 | 157.288903 | 1.428893 | 0.5209757 | 0 | -14.5213022 | 1.0620728 | 0.5463933 | 0 | 0.6224033 | 0.3246686 | 0.000000 | 0.000000 | 0.00000 | 0.0000000 |
| Firm 103 | 18.47424 | 47.916448 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 2.595551 | 1.123171 | 64.46984 | 0.6099226 |
| Firm 106 | 0.00000 | 0.000000 | 7.185743 | 1.579704 | 0.000000 | 0.0000000 | 0 | 0.1947479 | 0.0000000 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0.000000 | 0.000000 | 0.00000 | 0.0000000 |
| Firm 108 | 12.98278 | 4.660648 | 2.386303 | 41.942028 | 0.338826 | 0.0501317 | 0 | 3.3149043 | 0.6461881 | 0.0183560 | 0 | 0.3933908 | 0.9932665 | 3.596400 | 1.049790 | 12.44512 | 16.7744278 |
One of the strengths of the PRA’s supervisory infrastructure is that we are able to categorise firms based on their size, business model, area of operation etc. Each of these characteristics require their own supervisory approaches which the PRA can adapt to and enables more effective regulation. For my machine learning application I will look to cluster the banks into various distinct categories using principal component analysis so that we can better define them and our approach to them. I have chosen this cleaning approach before clustering because as a dimension reduction process it can help to extract the most important elements of each of the features and reduce the dataset into a more comprehensible size. With the multicolinearity problems that are shown at the start, this approach will mitigate this issue and improve cluster quality. With tidymodels we can easily create a pipeline from input to PCA output
pca_recipe <- recipes::recipe(firm ~ ., data = filtered_df) %>%
# we only want to sue complete rows
recipes::step_filter_missing(
all_numeric(),
threshold = 0) %>%
# normalise all columns
recipes::step_normalize(all_numeric()) %>%
# compute principal components
recipes::step_pca(
all_numeric(),
threshold = .95
)
pca_prep = recipes::prep(x = pca_recipe, training = filtered_df)
# plot this and cluster
pca_output_df = recipes::bake(object = pca_prep, filtered_df)
kable(head(pca_output_df)) %>%
kable_styling("striped", full_width = F)
| firm | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 |
|---|---|---|---|---|---|---|---|---|
| Firm 1 | -1.617780 | 0.0828767 | -0.0006132 | -0.0815506 | 0.0299363 | 0.0090921 | -0.0247785 | 0.0283331 |
| Firm 101 | -1.617780 | 0.0828767 | -0.0006132 | -0.0815506 | 0.0299363 | 0.0090921 | -0.0247785 | 0.0283331 |
| Firm 102 | 4.860253 | -2.8471428 | 1.3587408 | -3.5633653 | 0.1309897 | 0.0477675 | 0.5332307 | 0.4841562 |
| Firm 103 | -1.617780 | 0.0828767 | -0.0006132 | -0.0815506 | 0.0299363 | 0.0090921 | -0.0247785 | 0.0283331 |
| Firm 106 | -1.566940 | 0.1563090 | 0.0361121 | -0.1119333 | -0.0219634 | 0.0247992 | 0.0113430 | 0.0330784 |
| Firm 108 | 1.609362 | -0.6238122 | -2.2371248 | -1.0468584 | 0.5804565 | 1.0327301 | -0.5352348 | 1.0128553 |
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | |
|---|---|---|---|---|---|---|---|---|---|
| Standard deviation | 2.345756 | 0.9705261 | 0.7830859 | 0.6875206 | 0.6415613 | 0.5592552 | 0.5286541 | 0.4974662 | 0.46721 |
| Proportion of Variance | 0.611400 | 0.1046600 | 0.0681400 | 0.0525200 | 0.0457300 | 0.0347500 | 0.0310500 | 0.0275000 | 0.02425 |
| Cumulative Proportion | 0.611400 | 0.7160500 | 0.7841900 | 0.8367100 | 0.8824400 | 0.9172000 | 0.9482500 | 0.9757500 | 1.00000 |
This provides us with 7 principal components that explain over 95% of the total variance of the dataset.
I am now going to plot PC1 against PC2 to see if there are obvious clusters before performing k-means clustering.
There are few discernible groups using just PCs 1 and 2 alongside ggplot2’s geom_bin2d function to highlight clusters. I will use k-means clustering, an unsupervised ML technique used to group these points together minimising within-cluster variance and it is an incredibly efficient and explainable algorithm. The latter point is important when sharing findings with non-technical colleagues.